Recommender Systems

In this project, we build a movie recommender system. We read a dataset of movie ratings by users, then we select other movies that a specific user would be interesting in based on his previous choice.



In [3]:

    
import numpy as np
import pandas as pd

Read the data



In [4]:

    
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)



In [5]:

    
df.head()

Get movie titles



In [6]:

    
movie_titles = pd.read_csv("Movie_Id_Titles")
movie_titles.head()









    Out[6]:






  
    
      
      item_id
      title
    
  
  
    
      0
      1
      Toy Story (1995)
    
    
      1
      2
      GoldenEye (1995)
    
    
      2
      3
      Four Rooms (1995)
    
    
      3
      4
      Get Shorty (1995)
    
    
      4
      5
      Copycat (1995)

Merged dataframes



In [7]:

    
df = pd.merge(df,movie_titles,on='item_id')
df.head()









    Out[7]:






  
    
      
      user_id
      item_id
      rating
      timestamp
      title
    
  
  
    
      0
      0
      50
      5
      881250949
      Star Wars (1977)
    
    
      1
      290
      50
      5
      880473582
      Star Wars (1977)
    
    
      2
      79
      50
      4
      891271545
      Star Wars (1977)
    
    
      3
      2
      50
      5
      888552084
      Star Wars (1977)
    
    
      4
      8
      50
      5
      879362124
      Star Wars (1977)

Exploratory Data Analysis



In [8]:

    
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

Create a ratings dataframe with average rating and number of ratings



In [9]:

    
df.groupby('title')['rating'].mean().sort_values(ascending=False).head()









    Out[9]:





title
Marlene Dietrich: Shadow and Light (1996)     5.0
Prefontaine (1997)                            5.0
Santa with Muscles (1996)                     5.0
Star Kid (1997)                               5.0
Someone Else's America (1995)                 5.0
Name: rating, dtype: float64



In [10]:

    
df.groupby('title')['rating'].count().sort_values(ascending=False).head()









    Out[10]:





title
Star Wars (1977)             584
Contact (1997)               509
Fargo (1996)                 508
Return of the Jedi (1983)    507
Liar Liar (1997)             485
Name: rating, dtype: int64



In [11]:

    
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
ratings.head()









    Out[11]:






  
    
      
      rating
    
    
      title
      
    
  
  
    
      'Til There Was You (1997)
      2.333333
    
    
      1-900 (1994)
      2.600000
    
    
      101 Dalmatians (1996)
      2.908257
    
    
      12 Angry Men (1957)
      4.344000
    
    
      187 (1997)
      3.024390

Number of ratings column



In [12]:

    
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
ratings.head()









    Out[12]:






  
    
      
      rating
      num of ratings
    
    
      title
      
      
    
  
  
    
      'Til There Was You (1997)
      2.333333
      9
    
    
      1-900 (1994)
      2.600000
      5
    
    
      101 Dalmatians (1996)
      2.908257
      109
    
    
      12 Angry Men (1957)
      4.344000
      125
    
    
      187 (1997)
      3.024390
      41

Data Visualization: Histogram



In [13]:

    
plt.figure(figsize=(10,4))
ratings['num of ratings'].hist(bins=70)









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x29b68281518>



In [14]:

    
plt.figure(figsize=(10,4))
ratings['rating'].hist(bins=70)









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x29b686316a0>



In [15]:

    
sns.jointplot(x='rating',y='num of ratings',data=ratings,alpha=0.5)









    Out[15]:





<seaborn.axisgrid.JointGrid at 0x29b68668e48>

Recommending Similar Movies



In [16]:

    
moviemat = df.pivot_table(index='user_id',columns='title',values='rating')
moviemat.head()









    Out[16]:






  
    
      title
      'Til There Was You (1997)
      1-900 (1994)
      101 Dalmatians (1996)
      12 Angry Men (1957)
      187 (1997)
      2 Days in the Valley (1996)
      20,000 Leagues Under the Sea (1954)
      2001: A Space Odyssey (1968)
      3 Ninjas: High Noon At Mega Mountain (1998)
      39 Steps, The (1935)
      ...
      Yankee Zulu (1994)
      Year of the Horse (1997)
      You So Crazy (1994)
      Young Frankenstein (1974)
      Young Guns (1988)
      Young Guns II (1990)
      Young Poisoner's Handbook, The (1995)
      Zeus and Roxanne (1997)
      unknown
      Á köldum klaka (Cold Fever) (1994)
    
    
      user_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      NaN
      NaN
      2.0
      5.0
      NaN
      NaN
      3.0
      4.0
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      5.0
      3.0
      NaN
      NaN
      NaN
      4.0
      NaN
    
    
      2
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1.0
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      NaN
      NaN
      NaN
      NaN
      2.0
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 1664 columns

Most rated movies



In [17]:

    
ratings.sort_values('num of ratings',ascending=False).head(10)









    Out[17]:






  
    
      
      rating
      num of ratings
    
    
      title
      
      
    
  
  
    
      Star Wars (1977)
      4.359589
      584
    
    
      Contact (1997)
      3.803536
      509
    
    
      Fargo (1996)
      4.155512
      508
    
    
      Return of the Jedi (1983)
      4.007890
      507
    
    
      Liar Liar (1997)
      3.156701
      485
    
    
      English Patient, The (1996)
      3.656965
      481
    
    
      Scream (1996)
      3.441423
      478
    
    
      Toy Story (1995)
      3.878319
      452
    
    
      Air Force One (1997)
      3.631090
      431
    
    
      Independence Day (ID4) (1996)
      3.438228
      429

We choose two movies: starwars, a sci-fi movie. And Liar Liar, a comedy.



In [18]:

    
ratings.head()









    Out[18]:






  
    
      
      rating
      num of ratings
    
    
      title
      
      
    
  
  
    
      'Til There Was You (1997)
      2.333333
      9
    
    
      1-900 (1994)
      2.600000
      5
    
    
      101 Dalmatians (1996)
      2.908257
      109
    
    
      12 Angry Men (1957)
      4.344000
      125
    
    
      187 (1997)
      3.024390
      41

Now let's grab the user ratings for those two movies:



In [19]:

    
starwars_user_ratings = moviemat['Star Wars (1977)']
liarliar_user_ratings = moviemat['Liar Liar (1997)']
starwars_user_ratings.head()









    Out[19]:





user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: Star Wars (1977), dtype: float64

Using corrwith() method to get correlations between two pandas series:



In [20]:

    
similar_to_starwars = moviemat.corrwith(starwars_user_ratings)
similar_to_liarliar = moviemat.corrwith(liarliar_user_ratings)









    



C:\Users\Luiz Henrique\AppData\Roaming\Python\Python35\site-packages\numpy\lib\function_base.py:2487: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)

Clear data by removing NaN values and using a DataFrame instead of a series



In [21]:

    
corr_starwars = pd.DataFrame(similar_to_starwars,columns=['Correlation'])
corr_starwars.dropna(inplace=True)
corr_starwars.head()









    Out[21]:






  
    
      
      Correlation
    
    
      title
      
    
  
  
    
      'Til There Was You (1997)
      0.872872
    
    
      1-900 (1994)
      -0.645497
    
    
      101 Dalmatians (1996)
      0.211132
    
    
      12 Angry Men (1957)
      0.184289
    
    
      187 (1997)
      0.027398



In [22]:

    
corr_starwars.sort_values('Correlation',ascending=False).head(10)









    Out[22]:






  
    
      
      Correlation
    
    
      title
      
    
  
  
    
      Hollow Reed (1996)
      1.0
    
    
      Stripes (1981)
      1.0
    
    
      Star Wars (1977)
      1.0
    
    
      Man of the Year (1995)
      1.0
    
    
      Beans of Egypt, Maine, The (1994)
      1.0
    
    
      Safe Passage (1994)
      1.0
    
    
      Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991)
      1.0
    
    
      Outlaw, The (1943)
      1.0
    
    
      Line King: Al Hirschfeld, The (1996)
      1.0
    
    
      Hurricane Streets (1998)
      1.0

Filtering out movies that have less than 100 reviews (this value was chosen based off the histogram). This is needed to get more accurate results



In [23]:

    
corr_starwars = corr_starwars.join(ratings['num of ratings'])
corr_starwars.head()









    Out[23]:






  
    
      
      Correlation
      num of ratings
    
    
      title
      
      
    
  
  
    
      'Til There Was You (1997)
      0.872872
      9
    
    
      1-900 (1994)
      -0.645497
      5
    
    
      101 Dalmatians (1996)
      0.211132
      109
    
    
      12 Angry Men (1957)
      0.184289
      125
    
    
      187 (1997)
      0.027398
      41

Now sort the values



In [24]:

    
corr_starwars[corr_starwars['num of ratings']>100].sort_values('Correlation',ascending=False).head()









    Out[24]:






  
    
      
      Correlation
      num of ratings
    
    
      title
      
      
    
  
  
    
      Star Wars (1977)
      1.000000
      584
    
    
      Empire Strikes Back, The (1980)
      0.748353
      368
    
    
      Return of the Jedi (1983)
      0.672556
      507
    
    
      Raiders of the Lost Ark (1981)
      0.536117
      420
    
    
      Austin Powers: International Man of Mystery (1997)
      0.377433
      130

The same for the comedy Liar Liar:



In [25]:

    
corr_liarliar = pd.DataFrame(similar_to_liarliar,columns=['Correlation'])
corr_liarliar.dropna(inplace=True)
corr_liarliar = corr_liarliar.join(ratings['num of ratings'])
corr_liarliar[corr_liarliar['num of ratings']>100].sort_values('Correlation',ascending=False).head()









    Out[25]:






  
    
      
      Correlation
      num of ratings
    
    
      title
      
      
    
  
  
    
      Liar Liar (1997)
      1.000000
      485
    
    
      Batman Forever (1995)
      0.516968
      114
    
    
      Mask, The (1994)
      0.484650
      129
    
    
      Down Periscope (1996)
      0.472681
      101
    
    
      Con Air (1997)
      0.469828
      137

	user_id	item_id	rating	timestamp
0	0	50	5	881250949
1	0	172	5	881250949
2	0	133	1	881250949
3	196	242	3	881250949
4	186	302	3	891717742

	item_id	title
0	1	Toy Story (1995)
1	2	GoldenEye (1995)
2	3	Four Rooms (1995)
3	4	Get Shorty (1995)
4	5	Copycat (1995)

	user_id	item_id	rating	timestamp	title
0	0	50	5	881250949	Star Wars (1977)
1	290	50	5	880473582	Star Wars (1977)
2	79	50	4	891271545	Star Wars (1977)
3	2	50	5	888552084	Star Wars (1977)
4	8	50	5	879362124	Star Wars (1977)

	rating
title
'Til There Was You (1997)	2.333333
1-900 (1994)	2.600000
101 Dalmatians (1996)	2.908257
12 Angry Men (1957)	4.344000
187 (1997)	3.024390

title	'Til There Was You (1997)	1-900 (1994)	101 Dalmatians (1996)	12 Angry Men (1957)	187 (1997)	2 Days in the Valley (1996)	20,000 Leagues Under the Sea (1954)	2001: A Space Odyssey (1968)	3 Ninjas: High Noon At Mega Mountain (1998)	39 Steps, The (1935)	...	Yankee Zulu (1994)	Year of the Horse (1997)	You So Crazy (1994)	Young Frankenstein (1974)	Young Guns (1988)	Young Guns II (1990)	Young Poisoner's Handbook, The (1995)	Zeus and Roxanne (1997)	unknown	Á köldum klaka (Cold Fever) (1994)
user_id
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	2.0	5.0	NaN	NaN	3.0	4.0	NaN	NaN	...	NaN	NaN	NaN	5.0	3.0	NaN	NaN	NaN	4.0	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	rating	num of ratings
title
Star Wars (1977)	4.359589	584
Contact (1997)	3.803536	509
Fargo (1996)	4.155512	508
Return of the Jedi (1983)	4.007890	507
Liar Liar (1997)	3.156701	485
English Patient, The (1996)	3.656965	481
Scream (1996)	3.441423	478
Toy Story (1995)	3.878319	452
Air Force One (1997)	3.631090	431
Independence Day (ID4) (1996)	3.438228	429

	Correlation
title
'Til There Was You (1997)	0.872872
1-900 (1994)	-0.645497
101 Dalmatians (1996)	0.211132
12 Angry Men (1957)	0.184289
187 (1997)	0.027398

	Correlation
title
Hollow Reed (1996)	1.0
Stripes (1981)	1.0
Star Wars (1977)	1.0
Man of the Year (1995)	1.0
Beans of Egypt, Maine, The (1994)	1.0
Safe Passage (1994)	1.0
Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991)	1.0
Outlaw, The (1943)	1.0
Line King: Al Hirschfeld, The (1996)	1.0
Hurricane Streets (1998)	1.0

	Correlation	num of ratings
title
Star Wars (1977)	1.000000	584
Empire Strikes Back, The (1980)	0.748353	368
Return of the Jedi (1983)	0.672556	507
Raiders of the Lost Ark (1981)	0.536117	420
Austin Powers: International Man of Mystery (1997)	0.377433	130